136 research outputs found

    SSW Library: An SIMD Smith-Waterman C/C++ Library for Use in Genomic Applications

    Full text link
    Summary: The Smith Waterman (SW) algorithm, which produces the optimal pairwise alignment between two sequences, is frequently used as a key component of fast heuristic read mapping and variation detection tools, but current implementations are either designed as monolithic protein database searching tools or are embedded into other tools. To facilitate easy integration of the fast Single Instruction Multiple Data (SIMD) SW algorithm into third party software, we wrote a C/C++ library, which extends Farrars Striped SW (SSW) to return alignment information in addition to the optimal SW score. Availability: SSW is available both as a C/C++ software library, as well as a stand alone alignment tool wrapping the librarys functionality at https://github.com/mengyao/Complete- Striped-Smith-Waterman-Library Contact: [email protected]: 3 pages, 2 figure

    Graphical pangenomics

    Get PDF
    Completely sequencing genomes is expensive, and to save costs we often analyze new genomic data in the context of a reference genome. This approach distorts our image of the inferred genome, an effect which we describe as reference bias. To mitigate reference bias, I repurpose graphical models previously used in genome assembly and alignment to serve as a reference system in resequencing. To do so I formalize the concept of a variation graph to link genomes to a graphical model of their mutual alignment that is capable of representing any kind of genomic variation, both small and large. As this model combines both sequence and variation information in one structure it serves as a natural basis for resequencing. By indexing the topology, sequence space, and haplotype space of these graphs and developing generalizations of sequence alignment suitable to them, I am able to use them as reference systems in the analysis of a wide array of genomic systems, from large vertebrate genomes to microbial pangenomes. To demonstrate the utility of this approach, I use my implementation to solve resequencing and alignment problems in the context of Homo sapiens and Saccharomyces cerevisiae. I use graph visualization techniques to explore variation graphs built from a variety of sources, including diverged human haplotypes, a gut microbiome, and a freshwater viral metagenome. I find that variation aware read alignment can eliminate reference bias at known variants, and this is of particular importance in the analysis of ancient DNA, where existing approaches result in significant bias towards the reference genome and concomitant distortion of population genetics results. I validate that the variation graph model can be applied to align RNA sequencing data to a splicing graph. Finally, I show that a classical pangenomic inference problem in microbiology can be solved using a resequencing approach based on variation graphs.Wellcome Trust PhD fellowshi

    Haplotype-aware graph indexes

    Get PDF
    The variation graph toolkit (VG) represents genetic variation as a graph. Each path in the graph is a potential haplotype, though most paths are unlikely recombinations of true haplotypes. We augment the VG model with haplotype information to identify which paths are more likely to be correct. For this purpose, we develop a scalable implementation of the graph extension of the positional Burrows-Wheeler transform. We demonstrate the scalability of the new implementation by indexing the 1000 Genomes Project haplotypes. We also develop an algorithm for simplifying variation graphs for k-mer indexing without losing any k-mers in the haplotypes

    A profile in FIRE: resolving the radial distributions of satellite galaxies in the Local Group with simulations

    Get PDF
    While many tensions between Local Group (LG) satellite galaxies and LCDM cosmology have been alleviated through recent cosmological simulations, the spatial distribution of satellites remains an important test of physical models and physical versus numerical disruption in simulations. Using the FIRE-2 cosmological zoom-in baryonic simulations, we examine the radial distributions of satellites with Mstar > 10^5 Msun around 8 isolated Milky Way- (MW) mass host galaxies and 4 hosts in LG-like pairs. We demonstrate that these simulations resolve the survival and physical destruction of satellites with Mstar >~ 10^5 Msun. The simulations broadly agree with LG observations, spanning the radial profiles around the MW and M31. This agreement does not depend strongly on satellite mass, even at distances <~ 100 kpc. Host-to-host variation dominates the scatter in satellite counts within 300 kpc of the hosts, while time variation dominates scatter within 50 kpc. More massive host galaxies within our sample have fewer satellites at small distances, likely because of enhanced tidal destruction of satellites via the baryonic disks of host galaxies. Furthermore, we quantify and provide fits to the tidal depletion of subhalos in baryonic relative to dark matter-only simulations as a function of distance. Our simulated profiles imply observational incompleteness in the LG even at Mstar >~ 10^5 Msun: we predict 2-10 such satellites to be discovered around the MW and possibly 6-9 around M31. To provide cosmological context, we compare our results with the radial profiles of satellites around MW analogs in the SAGA survey, finding that our simulations are broadly consistent with most SAGA systems.Comment: 18 pages, 10 figures, plus appendices. Main results in figures 2, 3, and 4. Accepted versio

    Genomic diversity and novel genome-wide association with fruit morphology in <i>Capsicum</i>, from 746k polymorphic sites

    Get PDF
    Capsicum is one of the major vegetable crops grown worldwide. Current subdivision in clades and species is based on morphological traits and coarse sets of genetic markers. Broad variability of fruits has been driven by breeding programs and has been mainly studied by linkage analysis. We discovered 746k variable sites by sequencing 1.8% of the genome in a collection of 373 accessions belonging to 11 Capsicum species from 51 countries. We describe genomic variation at population-level, confirm major subdivision in clades and species, and show that the known major subdivision of C. annuum separates large and bulky fruits from small ones. In C. annuum, we identify four novel loci associated with phenotypes determining the fruit shape, including a non-synonymous mutation in the gene Longifolia 1-like (CA03g16080). Our collection covers all the economically important species of Capsicum widely used in breeding programs and represent the widest and largest study so far in terms of the number of species and number of genetic variants analyzed. We identified a large set of markers that can be used for population genetic studies and genetic association analyses. Our results provide a comprehensive and precise perspective on genomic variability in Capsicum at population-level and suggest that future fine genetic association studies will yield useful results for breeding

    The distribution and mutagenesis of short coding INDELs from 1,128 whole exomes

    Get PDF
    BACKGROUND: Identifying insertion/deletion polymorphisms (INDELs) with high confidence has been intrinsically challenging in short-read sequencing data. Here we report our approach for improving INDEL calling accuracy by using a machine learning algorithm to combine call sets generated with three independent methods, and by leveraging the strengths of each individual pipeline. Utilizing this approach, we generated a consensus exome INDEL call set from a large dataset generated by the 1000 Genomes Project (1000G), maximizing both the sensitivity and the specificity of the calls. RESULTS: This consensus exome INDEL call set features 7,210 INDELs, from 1,128 individuals across 13 populations included in the 1000 Genomes Phase 1 dataset, with a false discovery rate (FDR) of about 7.0%. CONCLUSIONS: In our study we further characterize the patterns and distributions of these exonic INDELs with respect to density, allele length, and site frequency spectrum, as well as the potential mutagenic mechanisms of coding INDELs in humans. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12864-015-1333-7) contains supplementary material, which is available to authorized users
    • …
    corecore